Artificial intelligence (AI) is revolutionizing weather forecasts, improving
both their accuracy and computational efficiency. However, these models have a
fatal flaw. The measure of an AI forecasting models quality is its average accuracy
across all gridpoints over the globe. This approach is in line with the mathematical
roots of the AI field, but fails to capture the real world impacts that drive our
desires for accurate weather forecasting. To understand why, lets take a look at
where on Earth model performance is worst. We will investigate
GraphCast,
Google's state of the art
deterministic AI model for weather forecasting, and consider its ability to predict
atmospheric temperature 3 days in advanceβa common benchmark for models.
The go-to metric in assessing the performance of these predictions is the root
mean square error (RMSE). Typically, RMSE is both temporally and geospatially averaged,
meaning you get one number to report as the quality of your model. Convenient. We will
begin by eliminating the reduction over the geospatial dimension (latitude and
longitude), allowing us to see the RMSE at each individual 1.5Β° by 1.5Β° cell across Earth.
In addition to sticking with GraphCast, the timespan that model performance will be assessed
over will consistently be twice-daily temperature values throughout 2020. GraphCast predictions
were retrieved from WeatherBench 2,
and temperature data from ECMWF's ERA5 dataset,
made available on the Copernicus Climate Data Store.
It is apparent that the model does not perform uniformly well across the globe. This is the pernicious result of using geospatially averaged accuracy as the one and only metric: unfair performance disparaties get masked. Given that the accuracy of extreme heat forecasts has a direct effect on mortality, it is a matter of life and death to be aware of the relative strengths of different models at individual locations. Let's take a look at where on the world GraphCast performs worst.
We find that there are a few notable outliers: Greece (GRC), Bulgaria (BGR), the Republic of North Macedonia (MKD),
TΓΌrkiye (TUR), Albania (ALB), Kosovo (XKX), and Namibia (NAM). These are regions where GraphCast performs significantly
worse as compared to in all other territories. Having this sort of knowledge is important, because it can help
decision-makers in those territories to be aware of whether tools like GraphCast are appropriate for their use.
The results from looking at GraphCast have been interesting, but it is an outstanding question whether these
disparities are an idiosyncracy of GraphCast or more systemically present in AI weather forecasting models. To
explore this question, we will now consider 5 additional models: Google's Spherical CNN, Keisler's GNN, NeuralGCM,
Huawei's Pangu-Weather, and FuXi. Additionally, let's look at not just their bias by territory, but also when grouping
the territories by their global subregion and income. We will characterize the unfairness in a model by looking at
the greatest difference in RMSE between any two groups for each attribute.
We see that it is not just GraphCast! Across all models, and across all three attributes, there are disparities in
model performance. All other model prediction data also comes from WeatherBench 2.
SAFE is a new open-source package I have developed to facilitate
all of the data exploration and fairness assessments I conducted. SAFE was used to gather the
per-attribute RMSE values in this exploration. Overall, the tool empowers
decision-makers with the insight to use the most locally-accurate model for them, and encourages AI
developers to prioritze fairness in their model performance by providing a convenient way to perform stratified
assessmentsβbreaking free of the single-metric paradigm.